En este ejemplo queremos mostrarles un modelo para identificar perfiles falsos en Instagram
Porqué es importante:
Puede afectar la reputación de la empresa.
Hacer perder dinero a usuarios que ven sus cuentas falsificadas
Evitar el acoso virtual.
Dar un mejor manejo a las noticias falsas.
Evita que la compañia pierda dinero bajando los costos de operación.
Usaremos la base de datos del censo para crear una pantalla de datos que resuma el contenido de la base y nos permita tener una mejor visualización.
Las etapas son:
Conocer el contexto de los datos y la problemática a resolver.
Recolectar y procesar los datos
Hacer un análisis exploratorio de los datos.
Establecer el mejor modelo para resolver el problema.
Ajustar el modelo.
Validar los resultados.
Intepretar y analizar los resultados.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
file_name = "https://raw.githubusercontent.com/Andres1984/Data-Analysis-with-R/master/Bases/atest.csv"
atest = pd.read_csv(file_name)
file_name = "https://raw.githubusercontent.com/Andres1984/Data-Analysis-with-R/master/Bases/atrain.csv"
atrain = pd.read_csv(file_name)
Acá se puede observar la base de datos original. Observe atentamente los nombres y valores.
Seguramente podrá ver que es muy difícil hacerse una idea de la utilidad que tiene esta base de datos.
Lo primero que vamos a hacer es generar un reporte de gráficos que me permita conocer un poco más acerca de las variables y sus relaciones.
atest
| profile pic | nums/length username | fullname words | nums/length fullname | name==username | description length | external URL | private | #posts | #followers | #follows | fake | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.33 | 1 | 0.33 | 1 | 30 | 0 | 1 | 35 | 488 | 604 | 0 |
| 1 | 1 | 0.00 | 5 | 0.00 | 0 | 64 | 0 | 1 | 3 | 35 | 6 | 0 |
| 2 | 1 | 0.00 | 2 | 0.00 | 0 | 82 | 0 | 1 | 319 | 328 | 668 | 0 |
| 3 | 1 | 0.00 | 1 | 0.00 | 0 | 143 | 0 | 1 | 273 | 14890 | 7369 | 0 |
| 4 | 1 | 0.50 | 1 | 0.00 | 0 | 76 | 0 | 1 | 6 | 225 | 356 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 115 | 1 | 0.29 | 1 | 0.00 | 0 | 0 | 0 | 0 | 13 | 114 | 811 | 1 |
| 116 | 1 | 0.40 | 1 | 0.00 | 0 | 0 | 0 | 0 | 4 | 150 | 164 | 1 |
| 117 | 1 | 0.00 | 2 | 0.00 | 0 | 0 | 0 | 0 | 3 | 833 | 3572 | 1 |
| 118 | 0 | 0.17 | 1 | 0.00 | 0 | 0 | 0 | 0 | 1 | 219 | 1695 | 1 |
| 119 | 1 | 0.44 | 1 | 0.00 | 0 | 0 | 0 | 0 | 3 | 39 | 68 | 1 |
120 rows × 12 columns
Esta base de datos contiene la información histórica de las cuentas de Instagram
Uno de los problemas usuales es la cantidad de cuentas falsas que aparecen constantemente.
Con la siguiente información, se intentará predecir si una cuenta es falsa o no.
Alguna variables son:
profile pic: Si tiene foto de perfil 1 para si 0 para no.
nums/length username: Proporción de letras usadas para el usuario. Está entre 0 y 1
fullname words: # palabras para el nombre.
nums/length fullname:Proporción de letras usadas para el nombre. Está entre 0 y 1
name==username: Si el nombre es igual al usuario.
description length: Descripción cuenta.
external URL: Tiene dirección web Si 1, no 0.
private: Es privado si 1 no 0.
#posts: # de post.
#followers: # de seguidores.
#follows: # de personas a quien sigue.
fake: Si la cuenta es falsa Si 1, No 0.
Lo primero que se debe hacer, es observar como se relacionan las variables y como se pueden organizar.
A continuación usted vera dos formas de crear reportes. La primera la puede observar haciendo click aqui
La segunda se puede observar sobre este documento. El objetivo de estos reportes es observar que hay en la base de datos de forma resumida e interactiva.
import pandas as pd
df_one = pd.get_dummies(atrain["fake"], drop_first=True)
df_two = pd.concat((df_one, atrain), axis=1)
df_two = df_two.drop(["fake"], axis=1)
df_sw = df_two.rename(columns={1: "fake"})
#!pip install sweetviz
import sweetviz
my_report = sweetviz.analyze([df_sw, "atrain"],target_feat='fake')
my_report.show_html('ReportSweetviz.html')
| | [ 0%] 00:00 -> (? left)
Report ReportSweetviz.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
from dataprep.eda import create_report
create_report(atrain)
0%| | 0/1793 [00:00<?, ?it/s]
| Number of Variables | 12 |
|---|---|
| Number of Rows | 576 |
| Missing Cells | 0 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 2 |
| Duplicate Rows (%) | 0.3% |
| Total Size in Memory | 54.1 KB |
| Average Row Size in Memory | 96.2 B |
| Variable Types |
|
| nums/length username is skewed | Skewed |
|---|---|
| nums/length fullname is skewed | Skewed |
| description length is skewed | Skewed |
| #posts is skewed | Skewed |
| #followers is skewed | Skewed |
| #follows is skewed | Skewed |
| profile pic has constant length 1 | Constant Length |
| name==username has constant length 1 | Constant Length |
| external URL has constant length 1 | Constant Length |
| private has constant length 1 | Constant Length |
| fake has constant length 1 | Constant Length |
|---|---|
| nums/length username has 299 (51.91%) zeros | Zeros |
| nums/length fullname has 518 (89.93%) zeros | Zeros |
| description length has 326 (56.6%) zeros | Zeros |
| #posts has 157 (27.26%) zeros | Zeros |
categorical
| Approximate Distinct Count | 2 |
|---|---|
| Approximate Unique (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 37.1 KB |
| Mean | 1 |
|---|---|
| Standard Deviation | 0 |
| Median | 1 |
| Minimum | 1 |
| Maximum | 1 |
| 1st row | 1 |
|---|---|
| 2nd row | 1 |
| 3rd row | 1 |
| 4th row | 1 |
| 5th row | 1 |
| Count | 0 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 576 |
numerical
| Approximate Distinct Count | 54 |
|---|---|
| Approximate Unique (%) | 9.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 9.0 KB |
| Mean | 0.1638 |
| Minimum | 0 |
| Maximum | 0.92 |
| Zeros | 299 |
| Zeros (%) | 51.9% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0 |
| Median | 0 |
| Q3 | 0.31 |
| 95-th Percentile | 0.57 |
| Maximum | 0.92 |
| Range | 0.92 |
| IQR | 0.31 |
| Mean | 0.1638 |
|---|---|
| Standard Deviation | 0.2141 |
| Variance | 0.04584 |
| Sum | 94.37 |
| Skewness | 1.2596 |
| Kurtosis | 1.0597 |
| Coefficient of Variation | 1.3068 |
categorical
| Approximate Distinct Count | 9 |
|---|---|
| Approximate Unique (%) | 1.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 37.1 KB |
| Mean | 1.0035 |
|---|---|
| Standard Deviation | 0.05887 |
| Median | 1 |
| Minimum | 1 |
| Maximum | 2 |
| 1st row | 0 |
|---|---|
| 2nd row | 2 |
| 3rd row | 2 |
| 4th row | 1 |
| 5th row | 2 |
| Count | 0 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 578 |
numerical
| Approximate Distinct Count | 25 |
|---|---|
| Approximate Unique (%) | 4.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 9.0 KB |
| Mean | 0.03609 |
| Minimum | 0 |
| Maximum | 1 |
| Zeros | 518 |
| Zeros (%) | 89.9% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0 |
| Median | 0 |
| Q3 | 0 |
| 95-th Percentile | 0.31 |
| Maximum | 1 |
| Range | 1 |
| IQR | 0 |
| Mean | 0.03609 |
|---|---|
| Standard Deviation | 0.1251 |
| Variance | 0.01566 |
| Sum | 20.79 |
| Skewness | 4.4251 |
| Kurtosis | 23.7269 |
| Coefficient of Variation | 3.4665 |
categorical
| Approximate Distinct Count | 2 |
|---|---|
| Approximate Unique (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 37.1 KB |
| Mean | 1 |
|---|---|
| Standard Deviation | 0 |
| Median | 1 |
| Minimum | 1 |
| Maximum | 1 |
| 1st row | 0 |
|---|---|
| 2nd row | 0 |
| 3rd row | 0 |
| 4th row | 0 |
| 5th row | 0 |
| Count | 0 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 576 |
numerical
| Approximate Distinct Count | 104 |
|---|---|
| Approximate Unique (%) | 18.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 9.0 KB |
| Mean | 22.6233 |
| Minimum | 0 |
| Maximum | 150 |
| Zeros | 326 |
| Zeros (%) | 56.6% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0 |
| Median | 0 |
| Q3 | 34 |
| 95-th Percentile | 122 |
| Maximum | 150 |
| Range | 150 |
| IQR | 34 |
| Mean | 22.6233 |
|---|---|
| Standard Deviation | 37.703 |
| Variance | 1421.5152 |
| Sum | 13031 |
| Skewness | 1.8619 |
| Kurtosis | 2.6492 |
| Coefficient of Variation | 1.6666 |
categorical
| Approximate Distinct Count | 2 |
|---|---|
| Approximate Unique (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 37.1 KB |
| Mean | 1 |
|---|---|
| Standard Deviation | 0 |
| Median | 1 |
| Minimum | 1 |
| Maximum | 1 |
| 1st row | 0 |
|---|---|
| 2nd row | 0 |
| 3rd row | 0 |
| 4th row | 0 |
| 5th row | 0 |
| Count | 0 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 576 |
categorical
| Approximate Distinct Count | 2 |
|---|---|
| Approximate Unique (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 37.1 KB |
| Mean | 1 |
|---|---|
| Standard Deviation | 0 |
| Median | 1 |
| Minimum | 1 |
| Maximum | 1 |
| 1st row | 0 |
|---|---|
| 2nd row | 0 |
| 3rd row | 1 |
| 4th row | 0 |
| 5th row | 1 |
| Count | 0 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 576 |
numerical
| Approximate Distinct Count | 193 |
|---|---|
| Approximate Unique (%) | 33.5% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 9.0 KB |
| Mean | 107.4896 |
| Minimum | 0 |
| Maximum | 7389 |
| Zeros | 157 |
| Zeros (%) | 27.3% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0 |
| Median | 9 |
| Q3 | 81.5 |
| 95-th Percentile | 489.5 |
| Maximum | 7389 |
| Range | 7389 |
| IQR | 81.5 |
| Mean | 107.4896 |
|---|---|
| Standard Deviation | 402.0344 |
| Variance | 161631.6834 |
| Sum | 61914 |
| Skewness | 12.9524 |
| Kurtosis | 210.0347 |
| Coefficient of Variation | 3.7402 |
numerical
| Approximate Distinct Count | 372 |
|---|---|
| Approximate Unique (%) | 64.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 9.0 KB |
| Mean | 85307.2361 |
| Minimum | 0 |
| Maximum | 1.5339e+07 |
| Zeros | 18 |
| Zeros (%) | 3.1% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 2 |
| Q1 | 39 |
| Median | 150.5 |
| Q3 | 716 |
| 95-th Percentile | 13264.25 |
| Maximum | 1.5339e+07 |
| Range | 1.5339e+07 |
| IQR | 677 |
| Mean | 85307.2361 |
|---|---|
| Standard Deviation | 910148.4577 |
| Variance | 8.2837e+11 |
| Sum | 4.9137e+07 |
| Skewness | 13.6434 |
| Kurtosis | 200.1978 |
| Coefficient of Variation | 10.6691 |
numerical
| Approximate Distinct Count | 400 |
|---|---|
| Approximate Unique (%) | 69.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 9.0 KB |
| Mean | 508.3819 |
| Minimum | 0 |
| Maximum | 7500 |
| Zeros | 11 |
| Zeros (%) | 1.9% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 5.75 |
| Q1 | 57.5 |
| Median | 229.5 |
| Q3 | 589.5 |
| 95-th Percentile | 1860 |
| Maximum | 7500 |
| Range | 7500 |
| IQR | 532 |
| Mean | 508.3819 |
|---|---|
| Standard Deviation | 917.9812 |
| Variance | 842689.5547 |
| Sum | 292828 |
| Skewness | 4.7127 |
| Kurtosis | 28.1787 |
| Coefficient of Variation | 1.8057 |
categorical
| Approximate Distinct Count | 2 |
|---|---|
| Approximate Unique (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 37.1 KB |
| Mean | 1 |
|---|---|
| Standard Deviation | 0 |
| Median | 1 |
| Minimum | 1 |
| Maximum | 1 |
| 1st row | 0 |
|---|---|
| 2nd row | 0 |
| 3rd row | 0 |
| 4th row | 0 |
| 5th row | 0 |
| Count | 0 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 576 |
El siguiente gráficose organiza de la siguiente forma:
El cuadro rojo son las cuentas faltas y que están etiquetadas con 1. El cuadro azul son las cuentas no falsas etiquetadas con 0.
Dentro del cuado azul, se puede observar que todas las cuentas tienen foto de perfil y por eso solo se puede ver la selección de 1.
Dentro del cuado rojo, se puede observar que no todas las cuentas tienen foto de perfil y por eso se ver la siguiente clasificación con 0 (Sin foto de perfil) y 1 (foto de perfil).
Dentro del cuado azul, se puede observar que todas las cuentas tienen foto de perfil y por eso solo se puede ver la selección de 1.
Para ver el impacto de estas cuentas falsas y si las personas las pueden diferenciar de forma fácil, el siguiente número que aparece es el número de seguidores que tiene cada una de las cuentas.
Como conclusión se puede ver que en general las personas no reconocen una cuenta falsa, ya que hay muchos usuarios que siguen este tipo de cuentas.
# una forma de presentar datos categóricos
grafico = px.treemap(df_sw, path=['fake','profile pic' ,'#followers'])
grafico.show()
Lo primero que se debe tener en cuenta es que este es un modelo de clasificación
El objetivo es crear un modelo para predecir más adelante si una cuenta puede ser o no falsa.
En este ejemplo usaremos uno conocido como random forest.
La ventaja es que es un modelo robusto ya que usa diferentes "mini modelos" para encontrar el resultado .
En el siguiente ejemplo se podrá observar como se clasifica una cuenta falsa de acuerdo a las variables.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
model = RandomForestClassifier(n_estimators=10, max_depth=3)
### split the data in train and test 67% and 33%
y=atrain["fake"]
## drop the quality column
x=atrain.drop(columns="fake")
# Train
model.fit(x, y)
# Extract single tree
estimator = model.estimators_[5]
fig = plt.figure(figsize=(15, 10))
plot_tree(model.estimators_[0],
feature_names=['profile pic', 'nums/length username', 'fullname words',
'nums/length fullname', 'name==username', 'description length',
'external URL', 'private', '#posts', '#followers', '#follows'],
class_names="fake",
filled=True, impurity=True,
rounded=True)
[Text(581.25, 660.625, 'nums/length username <= 0.195\ngini = 0.5\nsamples = 368\nvalue = [280, 296]\nclass = a'), Text(290.625, 471.875, '#follows <= 93.5\ngini = 0.383\nsamples = 226\nvalue = [259, 90]\nclass = f'), Text(145.3125, 283.125, '#followers <= 67.5\ngini = 0.435\nsamples = 53\nvalue = [24, 51]\nclass = a'), Text(72.65625, 94.375, 'gini = 0.194\nsamples = 38\nvalue = [6, 49]\nclass = a'), Text(217.96875, 94.375, 'gini = 0.18\nsamples = 15\nvalue = [18, 2]\nclass = f'), Text(435.9375, 283.125, '#followers <= 49.0\ngini = 0.244\nsamples = 173\nvalue = [235, 39]\nclass = f'), Text(363.28125, 94.375, 'gini = 0.0\nsamples = 10\nvalue = [0, 16]\nclass = a'), Text(508.59375, 94.375, 'gini = 0.162\nsamples = 163\nvalue = [235, 23]\nclass = f'), Text(871.875, 471.875, '#posts <= 9.5\ngini = 0.168\nsamples = 142\nvalue = [21, 206]\nclass = a'), Text(726.5625, 283.125, '#followers <= 213.0\ngini = 0.042\nsamples = 117\nvalue = [4, 181]\nclass = a'), Text(653.90625, 94.375, 'gini = 0.0\nsamples = 107\nvalue = [0, 173]\nclass = a'), Text(799.21875, 94.375, 'gini = 0.444\nsamples = 10\nvalue = [4, 8]\nclass = a'), Text(1017.1875, 283.125, '#followers <= 123.0\ngini = 0.482\nsamples = 25\nvalue = [17, 25]\nclass = a'), Text(944.53125, 94.375, 'gini = 0.0\nsamples = 11\nvalue = [0, 18]\nclass = a'), Text(1089.84375, 94.375, 'gini = 0.413\nsamples = 14\nvalue = [17, 7]\nclass = f')]
Este se análisis se realizará considerando el camino de si la cuenta es falsa.
Como se puede observar en la caja superior, el modelo inicia diciendo que sin una cuenta sigue menos de 94 cuentas se puede considerar falsa.
En el segundo nivel se observa en la caja azul que si la cuenta tiene menos de 95 seguidores, puede considerarse falsa.
Finalmente, si después de descartar esas cuentas, se revisa y se observa que aun quedan algunas con menos de 68 seguidores estas se consideran falsas.
Intenta hacer el mismo ejercicio combinando los demás resultados del árbol.
Si no sabes cómo, puedes preguntarle a algún asesor para que te ayude a entender mejor el modelo.
Si fueras el gerente de Instagram, ¿qué decisiones podrías tomar con esta información?.